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Abstract. Gamma-ray bursts provide what is probably one of the messiest of 
all astrophysical data sets. Burst class properties are indistinct, as overlapping 
characteristics of individual bursts are convolved with effects of instrumental and 
sampling biases. Despite these complexities, data mining techniques have allowed 
new insights to be made about gamma-ray burst data. We demonstrate how data 
mining techniques have simultaneously allowed us to learn about gamma-ray burst 
detectors and data collection, cosmological effects in burst data, and properties of 
burst subclasses. We discuss the exciting future of this field, and the web-based tool 
we are developing (with support from the NASA AIS R Program) . We invite other s 
to join us in Al-guided gamma-ray burst classification (http://grb.mnsu.edu/grb/). 



1 Introduction 

Understanding the physics of a class of astronomical objects depends on 
identifying intrinsic behaviors. When two or more subclasses are present, 
each subclass is defined in terms of its own intrinsic behaviors. The process 
of identifying behavioral characteristics is difficult when the objects' observed 
characteristics (or attributes) overlap. Such is the case for cosmic gamma-ray 
bursts (GRBs), which have a large spread in observed attribute values. Some 
GRB attribute dispersion is intrinsic, some is caused by measurement error, 
some is due to systematic (e.g. instrumental and sampling) biases, and some 
is caused by the presence of subclasses. 

GRB subclass behaviors are difficult to delineate from other causes of 
attribute dispersion. Two GRB subclasses have been known to exist for some 
time Q 0, but it has been difficult to assign individual GRBs to a class 
because of attribute overlap. Class assignment has been complicated even 
more by the statistical clustering identification of a third GRB subclass ; 
properties of this third subclass overlap those of the other two. 

GRB classification can be aided by Knowledge Discovery in Databases 
(KDD) 1^]. The approach uses pattern recognition algorithms from the Arti- 
ficial Intelligence (AI) branch of computer science to find behaviors indicative 
of subclasses. KDD offers a methodology by which meaningful information 
can be extracted from large volumes of data. The KDD process (Figure ^ is 
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composed of data pre-processing and storage (data warehousing), data min- 
ing (clustering software), and scientific/logical assessment. Statistical and 
systematic effects (e.g. instrumentation and sampling biases) can be identi- 
fied and even removed in the assessment step. 
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Fig. 1. Gamma- Ray Burst Classification Process 



AI classifiers are typically supervised or unsupervised. Supervised classi- 
fiers require training instances (data elements) in order to develop classifica- 
tion rules for unknown instances. Unsupervised classifiers try to subclassify 
a data set by searching for clusters in multidimensional attribute spaces. 

We are developing a web-based tool Q] for the classification of GRB 
data (http://grb.mnsu.edu/grb/). The tool contains a preprocessed GRB 
database, AI classifiers, and data visualization software. In this manuscript 
we describe some of our initial scientific results concerning GRB data mining 
with this tool. Additional results have been published elsewhere Q. 



2 Support for the Existence of Three GRB Subclasses 

Statistical clustering analysis [M has revealed the presence of a third GRB 
subclass for BATSE 3B data Three major attributes delineate the three 
classes; S23 fluence (time-integrated fiux in the 50 to 300 keV range), T90 
duration (time interval during which 90% of the burst's emission is received), 
and HR321 hardness ratio (the fluence in the 100 to 300 keV band divided by 
the fluence in the 25 to 100 keV band) . The properties of the three subclasses 
are demonstrated in Table |l|. 

We examine the viability of these subclasses using the decision tree classi- 
fler C4.5 Q. A decision tree is a supervised classifler that develops rules by 
sorting through training instances via a series of branching tests. The results 
of the tests are turned into IF THEN ELSE statements. 

We use C4.5 to demonstrate a new data visualization technique we call 
"Fuzzy Controlled Learning" (or FCL). FCL helps users to visualize the at- 
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Table 1. Statistical clustering classes, from SB GRBs. 



Attributes 


Class 1 (Long) 


Class 2 (Short) 


Class 3 (Intermediate) 


T90: 


long 


short 


intermediate 


S23: 


bright 


faint 


intermediate 


HR321: 


intermediate 


hard 


soft 



tribute space in which subclasses reside, while recognizing that the subclass 
distributions overlap in this space. FCL is best used when a principal at- 
tribute is available that serves as a performance indicator. We assume for 

this analysis that T90 duration is the principle attribute, since the longest 
and shortest GRBs have quite different characteristics. 

We withhold 50 GRBs from the long and short ends of the BATSE 3B 
T90 distribution as "comparison" GRBs. These GRBs are considered to have 
attributes (e.g. fluencc and hardnesses) most indicative of the long and short 
subclasses. Initially, 50 long and 50 short GRBs from the remaining data are 
used as training instances for C4.5. C4.5 produces first a decision tree and 
then a rule set for classifying these GRBs. The rules are applied to the com- 
parison bursts; from this rule accuracy is determined. On each subsequent 
application, training instances are selected farther from the ends of the T90 
distribution; rule accuracies are determined for each training set. The accu- 
racies indicate how closely GRBs in that particular region of the attribute 
space compare to those in the comparison set; a score near 100% indicates 
that training set is indistinguishable from the comparison set, while a score 
near 50% indicates that C4.5 could only guess at subclass characteristics. 
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Fig. 2. FCL contour plot. Contours indicate the following agreements between 
training and comparison data: 90% (dark), 80%, 70%, 50%, and 30% (Ught). 
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Figure g is the FCL contour plot of these rule accuracies. The vertical 
axis is the distance of the long training cluster center (in units of numbers of 
GRBs) from the longest GRB, while the horizontal axis is the distance of the 
short training cluster center from the shortest GRB. The darkest contours 
(accuracies > 70%) near the x-axis indicate that there are several hundred 
GRBs with long burst characteristics; there are far fewer clearly defined short 
GRBs near the y-axis. Interestingly, the lightest contours (accuracies < 30%) 
occur roughly 200 GRBs from the short end and 350 GRBs from the long 
end (corresponding to T90s between 4.5 seconds and 16.8 seconds). GRBs in 
this T90 range have characteristics dissimilar to both long and short bursts; 
this failure of the two subclass hypothesis to explain the GRB data supports 
the existence of a third (intermediate) subclass. 

3 Is Each Subclass a Separate Source Population? 

C4.5 is subsequently trained on the three GRB classes defined from 
B ATSE 3B data |9| . Several GRBs are found to have peculiar hardness ratios 
which result from large individual channel fluence errors. The GRBs with the 
largest 10% relative errors (error divided by measurement) are removed, and 
the remaining 3B GRBs are reclassified using C4.5. The resulting rules are 
used to classify 4B Catalog GRBs and thus increase the database size. 

With the larger classification database, the spectral hardness dependence 
is examined in terms of spectral fitting parameters a, /3, and Epcak 01- Using 
only these three attributes, C4.5 accurately classifies most of the 4B GRBs. 
The resulting rules separate Class 2 from Class 1, but can not delineate 
Class 3 from Class 1 (85% of Class 3 GRBs are assigned to Class 1). Class 
3 GRBs are found to have Epeak values similar to Class 1 bursts of the 
same peak flux (Figure The correlation between Epeak and peak flux 
appears due to cosmological redshift |^ . Since one of the three defining Class 3 
characteristics is a data correlation, we hypothesize that instrumental and/or 
sampling biases can cause some Class 1 GRBs to take on Class 3 values (e.g. 
some Class 1 GRBs might appear shorter and fainter than expected). 

Figure ^ is a plot of fluence vs. 1024 ms peak flux for each of the three 
subclasses, limited to GRBs detected with one homogeneous set of BATSE 
trigger criteria. There are distinct bounds outside of which no GRBs are 
found. GRBs with 1024 ms peak fluxes less than 0.2 photons cm~^ sec~^ 
are not detected, since this is BATSE's minimum detection threshold. GRBs 
do not have fluences less than their time-integrated 1024 ms peak fluxes, 
establishing a lower fluence limit. 

Figure || overlays log(T90) contours for Class 1 GRBs on the fluence 
vs. 1024 ms peak flux space. The contours demonstrate that GRBs can be 
modeled as a series of pulses, with pulses containing most of the fluence and 
interpulse separations primarily defining the duration. Most Class 2 bursts 
are single-pulsed events as measured on the 1024 ms timescale. This helps 
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3 Classes: 4B Catalog 

X = Gloss 1, * = Class 2, = Class 3 



1024ms peak flux (photon cm ' sec ') 



Fig. 3. Epeak vs. pl024 for the Three GRB Classes. 




Fig. 4. Fluence vs. pl024 for the Three GRB Classes. 



define the characteristics of the third distinct region outside of which no 
GRBs are found: high fluence, faint Class 1 GRBs are missing, whereas low 
fluence faint. Class 1 GRBs are present. A bias favoring detection of GRBs 
with few photons over those with many photons seems unlikely, so we suspect 
a bias that removes Class 1 fluence relative to peak flux. 

We have dimmed five temporally different GRBs through ten peak fluxes 
in order to study their measured properties as they fade into background. 
Each dimmed GRB's time history is Poisson "noisified," then the peak flux 
and fluence are re-measured. 

The time interval bounding the fluence measurement (the fluence duration 
[^) appears to strongly influence the amount of fluence measured. If the same 
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Fig. 5. S23 vs. pl024 for Class 1 GRBs; contours indicate constant log(T90) regions. 

fluence duration interval is used for undimmed and dimmed measurements, 
then the fluence-to-peak flux ratio does not change as a GRB is dimmed. 
If, however, the fluence duration interval is shortened to account for faint 
pulses disappearing into the background and becoming unrecognizable, then 
the fluence-to-peak flux ratio decreases as the burst is dimmed (see Figure 
This bias is stronger near trigger threshold. 

Fluence durations taken from BATSE Catalogs provide supportive evi- 
dence for this mechanism (see Figure |^). Fluence durations of faint Class 1 
GRBs are shorter than those of bright Class 1 GRBs [|| . 




0.1 1.0 10-0 100-0 1000-0 

1024 ms peak flux (photon cm"^ sec"') 

Fig. 6. Fluences and peak fluxes of five decremented and noisified Class 1 GRBs 
(durations taken from identifiable pulses). 
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Fig. 7. Fluence Durations of Class 1 GRBs. 



4 Conclusions 

We have demonstrated that data mining techniques can aid the interpreta- 
tion of scientific data, even with complex and ambiguous GRB data. Data 
mining demonstrates that some Class 1 (Long) GRBs can develop Class 3 
(Intermediate) characteristics via a combination of the hardness intensity re- 
lation and the fluence duration bias. Class 3 (Intermediate) GRBs do not 
appear to represent a separate source population, although they cluster in 
the duration, fluence, hardness, attribute space. Class 2 (Short) GRBs do 
appear to represent a separate source population. 
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